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Abstract 

We consider Bayesian online static parameter estimation for state-space models. 

This is a very important problem, but is very computationally challenging as the state- 
of-the art methods that are exact, often have a computational cost that grows with the 
time parameter; perhaps the most successful algorithm is that of SMC 2 [9]. We present 
a version of the SMC 2 algorithm which has computational cost that does not grow with 
the time parameter. In addition, under assumptions, the algorithm is shown to provide 
consistent estimates of expectations w.r.t. the posterior. However, the cost to achieve 
this consistency can be exponential in the dimension of the parameter space; if this 
exponential cost is avoided, typically the algorithm is biased. The bias is investigated 
from a theoretical perspective and, under assumptions, we find that the bias does not 
accumulate as the time parameter grows. The algorithm is implemented on several 
Bayesian statistical models. 

Keywords: State-Space Models; Bayesian Inference; Sequential Monte Carlo. 


1 Introduction 

We consider a state-space models, that is, a pair of discrete-time stochastic processes, 
{X n } n>0 and {Y n } n>v The hidden process {X n } n>0 is a Markov chain and the joint 
density for n observations of the hidden process and the observations is 

n 

M*o) II fo(xk\xk-i)go{yk\xk) 

k =1 

where Xk G X, k > 0, and yk G Y, k > 1, 9 G 0 C is a static parameter with prior ng. In 
particular, it is of interest to infer the posterior on (9,x o :n ), conditional upon (yi,... ,y n ) 
{xq : n = (xo, ■ ■ ■ ,x n )) as the time parameter n grows. This problem is of interest in a wide 
variety of applications, including econometrics, finance and engineering; see for instance [5]. 

In general, even if 9 is fixed, the posterior cannot be computed exactly and one often has 
to resort to numerical methods, for example by using sequential Monte Carlo (SMC) (see 
e.g. [16]). SMC makes use of a collection of proposal densities and sequentially simulates 
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from these N > 1 samples, termed particles. In most scenarios it is not possible to use 
the distribution of interest as a proposal. Therefore, one must correct for the discrepancy 
between proposal and target via importance weights. In the majority of cases of practical 
interest, the variance of these importance weights increases with algorithmic time. This 
can, to some extent, be dealt with via a resampling procedure consisting of sampling with 
replacement from the current weighted samples and resetting them to 1/N. However, as is 
well known in the literature, due to the path degeneracy problem for particle filters, when 6 
is a random variable SMC methods do not always work well; see the review of [20] for details. 
This has lead to a wide variety of techniques being developed, including [9, 11, 17, 18, 21]; 
see [20] for a full review. 

The method which one might consider to be the state-of-the-art for the Bayesian online 
static parameter estimation for state-space models, is that in [9]. This approach combines 
the methods of SMC samplers [14] and particle Markov chain Monte Carlo (PMCMC) [1]. 
The method provides consistent estimates (that is, as the number of samples grows) of 
expectations w.r.t. the posterior on (0, xo :n ), as the time parameter grows. The method has 
been shown to work well in practice, but has one major issue; the computational cost of 
the application of PMCMC kernels grows as the time parameter grows; whilst the amount 
of times that the application of the kernel may stablize, one will still need to apply the 
kernel during the algorithm (although a variety of tricks can be used to reduce the cost; see 
[19]). We present a version of the SMC 2 algorithm which has computational cost that does 
not grow with the time parameter. In addition, under assumptions, the algorithm is shown 
to provide consistent estimates of expectations w.r.t. the posterior. However, the cost to 
achieve this consistency can be exponential in the dimension of the parameter space; if this 
exponential cost is avoided, typically the algorithm is biased. The bias is investigated from 
a theoretical perspective and, under assumptions, we find that the bias does not accumulate 
as the time parameter grows. 

Our approach is based upon breaking up the observations into blocks of T observations; 
an approach used, differently, in other articles such as [8, 11, 18]. The idea is simply to 
develop a ‘perfect’ SMC 2 algorithm which has computational cost that does not grow with 
the time parameter; in practice, one cannot implement this algorithm and so one must 
approximate it. Our approach simply uses the approximation of an appropriate target from 
the previous block. It is remarked that if the posterior exhibits concentration/Bernstein- 
Von Mises properties as the time parameter grows, then alternative schemes are relevant, 
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which could be prefered to the ideas here; however, such properties do not always hold. 
See for instance the work of [15] in the context of MLE for relatively weak conditions; the 
properties of the MLE can impact on the concentration/Bernstein-Von Mises properties of 
the posterior (see e.g. [4]). 

In Section 2, SMC 2 is reviewed and our algorithm is presented. In Section 3 our theoret¬ 
ical results are presented, with proofs housed in the appendix. In Section 4 our simulation 
results are presented. In Section 5 the article is concluded and several possible extensions 
are discussed. 


2 Algorithm 

2.1 Notations 

Let (E,£) be a measurable space. The notation Bb(E ) denotes the class of bounded and 
measurable real-valued functions. Cb(E) dentes the continuous, bounded measurable real¬ 
valued functions on E. The supremum norm is written as ||/|| 00 = sup x6B |/(x)|. £?(E) 
is the set of probability measures on (E,£). We will consider non-negative operators K : 
E x £ — > R + such that for each x G E the mapping A K»• K(x,A) is a finite non-negative 
measure on £ and for each A G £ the function x K > K(x, A) is measurable; the kernel K is 
Markovian if K(x,dy) is a probability measure for every x G E. For a finite measure fi on 
(E,£), real-valued and measurable / : E —> K. 

IxK : A H> J K{x,A)n{dx) \ Kf:x J f(y)K(x,dy). 

We also write y(f) = f f(x)/x(dx) ■ II • lltv denotes the total variation distance. 


2.2 SMC 2 

SMC 2 uses SMC to sample the following sequence of targets, for a fixed N x > 1 

n 1 N x 

MO a ^( 0 )^e(4;nfo a i;n x )(n 

fc=1 x i =1 

with C = {0,xl : .^ ,a 1 1 :^ :c ) G 0 x x {1,..., N x } n = E n , where 




n N x 


®6( x 0m*r a lm X ) = ( IT ^(^o)) II ( II ^ , /e(4kfc-l) 


k— 1 % 


3 



with 3 ( 2 / 0 !*) = 1 for any x (E X. Throughout, we write the normalizing constant of 7 r„ as 
We make the following defintion for notational convenience in the sequel: 



ii 

iS? 


At* 

- (n 


Atr 


rjeivk i|* fc 1 ) f e {x\\xl k _-S)(^r^2ge{yk\x l k )) (l) 


L = A iEf=i^(y fc -il4_i)‘ 
ge(jjk-i\x a k k _ 1 ) 


i=l 


/^(4K-i) • 


( 2 ) 


We provide a Feynman-Kac representation of the SMC 2 algorithm which facilitates the- 

1:Nx a Nx ) 
0:fc+l’ “life+l/ 


oretical analysis and will assist our subsequent notations. Let a k = (0, *o-£i’ a nfe+i) an< f 


set 


AT, 


770 (a 0 ) = 7 r e ( 6 >) (n ^(*0)) (n 

and for any k > 1 


2 — 1 


.90 (go No 1 ) 


fe(x INo 1 ) 


AA, 


M k (a k -i,da k ) = 


M fc (a fc _ l9 da fc )( II ^£^WW*(4+il*? +1 )^,(^) 




where M*, is tt^— invariant (i.e. typically a particle marginal Metropolis-Hastings (PMMH) 
kernel [1]). Then we set 


G k (a k ) = ^^2ge(yk+i\x k+1 ). 


N x 


N, 


and 


n— 1 


7n(^) = / ^(«n) JJ Gp(a fe )77o(ao) JJ M fe (a fe _i,da fe )da 0 . 

^ p=0 fc=l 

The SMC algorithm which approximates the n—time marginal 


Vn{v) = 


7n(<P) 


will have joint law: 


7n(l) 

Y[t] 0 {a z 0 )da z 0 P ( n $ fc(^!)( dl 


N 


N 


a i u 


( 3 ) 

i =1 fc=l \i=l / 

where Q k (fj,)(dx) = y(G k -iM k (dx)) / y(G k -\) is the usual selection-mutation operator and 

V k -i is the empirical measure of the particles. It is easily shown that by approximating 
r/ n one can approximate the posterior on 9 and Xy n given y\ :n \ see [9, 13]. We note that, 
typically one will apply the kernel M k when dynamic resampling is performed; however the 
form of the algorithm is simple to describe with resampling at each time step. 


2.3 Perfect Algorithm 

One issue with the above algorithm is that whenever the kernel M k is applied, one must 
sample trajectories of the hidden states which grow with the time parameter. Thus, even 
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if resampling is dynamically performed, leading to an application of M k , the cost of the 
algorithm will increase with the time parameter. So one can say whilst SMC 2 is a very 
powerful algorithm, it is not an online algorithm. Here we present an ‘ideal’ or perfect 
version of the algorithm that has computational cost which does not grow with time, but 
cannot be implemented in general. This algorithm will provide the basis for our biased 
algorithm in the next section. 

To begin, we set B £ Z + which in principle can grow and T £ Z + which is fixed. The 
parameter T will represent the maximum length of the trajectory of the hidden state, which 
one wants to sample. We define the following target probabilities: 

N x n 

OC TT0 (6>) ( M x oi) II X k-Zk’ ( 4 ) 

i—l k—1 

n£ {0,... ,T} 


and 

/ /} 1 : N x 1 : N x \ 

7T n,bT\Vi x (b-l)T+l:n’ a (b-l)T+2:n' (X 

N x N x n 

(il^’^-ilT+i)) {^-Y^9e{y(b-i)T+i\x\ b _ 1)T+1 ) S j Pk{0,x^. k ,a}; N *) (5) 

i—l x i=1 fc=(6-l)T+2 

n £ {(b — l)T + 1,..., bT}, b £ 


where is as (1) and 

, (6—1)T 

((d,x {b _ 1)T+ i) (xtt e {6) fe{x( b -i) T +i\x(b-i)T)ve{xo) fs{ x k\xk-i)ge{yk\xk)dx 0 . {b _ 1)T . 

•* fc =i 

If one can approximate the targets above one can approximate the posterior on 9 and Xo, n 
given y\ :n . As we will see below, our algorithm to achieve this cannot be implemented in 
practice, but has the benefit that the computational cost per-time step cannot grow beyond 
a given bound. 


2.3.1 Algorithm 


For n £ {0,..., T} one can run the SMC 2 algorithm as in Section 2.2. That is to run the 
algorithm with law (3) until time T — 1. We will add a final time step that will resample 
the N particles according to Gt- i- That is, defining Mt as the Dirac measure, we sample 
from 


N T / N 

Y[y 0 (a l 0 )da l 0 f n$k(Vk-i){dal) 

i—l k=1 V i—l 

For the subsequent blocks, we make the following definitions. Let b £ {2, ...,£?}, 


&(b-i)T+i — (0 > e — E((,_ 1 )t’ +1 , a n 


{9,x 


1:N X 

(6-l)T+l:n+l 


a 


1:N X 

(6-l)T+2:n+l 


)e 
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0 x x(™~( B-1 ) T+1 ) JV * x {1, Nx }&-(b-i)t)n. = E„, n £ {(6 - 1)T + 1,..., bT - 1}, and 
&bT = (0, x (&-l)T+l:bT’ a \b N -i)T+ 2 -.bT) X™* x {1..., N x }^~^ = E bT . Now let 

N x 

G n (a n ) = — Y^ge(y n K) n £ {(6- 1)T + 1,.. .,6T}. 

x -i—1 

Set 

N x 

5 ?(6-l)T+l(0!(6-l)T+l) = (Jt ((^^(b-lJT+l))’ 

j=l 

Define M n ^T,-K rl _ 1 bT as a Markov kernel of invariant density 'Kn-x^T, n £ {(b — 1)T + 
2,..., bT — 1} such as a PMMH kernel. Then define 

bT, 1) du n ) -^n,bT,ir„_i,b T i^n —1 > d6i n ^, \ )/?n+l (0 , (£ n:n _|_i) > ( a n+l ) )^(®ra:n+l) 

where d ra = (d^_ 1; (aJ^+i) 7 , (a^f)') and /3 is as (2). Set M b T,bT,-K bT _ libT as a Dirac mass. 
Finally, for g € &{E n - 1 ), n € {( b — 1)T + 2,..., &T} and probability density ij) on E n -i 


define 


^n,6T,^/> (/^) (dd n ) . 


[l{Cj rl — i JVbn,bT^ (dd n )) 


Then the perfect algorithm has joint law for b £ {2,..., B}, n £ {(b — 1 )T + 1,..., bT} 


N n N 

Jt 7 7(b-l)T+l(d( b _ 1 ) T+1 ) Jt Jt &k„bT,Tr k - ltbT {Vk-i,bT)(da l k ) 

i =1 fc=(6-l)T+2*=l 


where bT is the empirical measure of the particles at time k — 1. 


2.3.2 Remarks 


Set, for b £ {2,...,£?}, n £ {(& — 1)T + 2,..., bT} 


7n,bT{da n ) = _ ]^[ Gfc(dfc)77( & _i)T+i(dd(b_i)T+i)x 

JE (b _ 1)T+1 x-E n ^i k =(b-X)T+X 

n 

I”] Mk,bT^ k _ ltbT {&k-i-,do.k) 

fe=(6-l)T+l 

and fj n ,bT{da n ) = ‘j n ,bT(da n )/ A f ntb T{l), then we note that for ^ : E n _i —> K, ir njb T —integrable 

Vn,bT(G n ip) _ 

Vn,bT(G n ) 

I (/) 1 : N x 1 : N x \ / r\ 1: N x 1: N x \ 1//1 1: iV x \ 

(p(0 , ^( 5 _!)^ +1:n? a (6-l)T+2:n) 7rn ’ 6T (^’ X (b-l)T+l:n’ a (6-l)T+2:n)^(^’ X (6-l)T+l:n) 

E n — i 

so that one can estimate expectations w.r.t. ir n ,bT via 

fln,bAG n y) 

€M 6 n) ' 
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2.4 Approximate Algorithm 

The problem with the previous algorithm is that one can seldom evaluate C(@> x (b-i)T+i) 
nor sample from it perfectly. We introduce the following algorithm, which will sample from 
the following targets. For the first block, one can run the SMC 2 algorithm to target the 
sequence (4). At the subsequent blocks, one is unable to evaluate the target, nor sample 
from the proposals. We propose the following approximate targets to replace (5): 

AT X N 

n n ,bT{0, X (b-l)T+l:n’ a (b-l)T+2:rt) <X ( IT j\f -Av(# — ^ )/fl ( X (b-1)T+1 l X (6-1 )t)) X 

i=l j-l 

^ N x n 

9e{y(b-\)T+\ k(6-i)T+i)) n (7) 

* i—1 fc=(b-l)T+2 

with n G {(& — 1)T + 1,..., bT}, b G {2,..., B}, 0 1:N , sam pl es from the algorithm 

at the previous block, which we shall describe how to obtain and Kn{0 — O') a kernel 
density whose bandwidth may depend on N. At the end of a block of the algorithm, we 
just take 0 l ' N , as the samples we have obtained, taking the first of the N x — tuples 

of Nic (as the samples are exchangeable). Using standard SMC theory, which we will 
expand upon, one can prove that if the bandwidth of K appropriately depends on N that 
at least Tt n ^T(0, o^r+ 2 -n) w ill converge almost surely (in an appropriate sense) to 

^u, 2 t{ 0 , x^}. n , a^+ 2 -n) as N g rows J hence providing the justification of the approximation 
introduced. 

Set 

N x ^ N 

%_l )T+ i(a ( &-i)T+i) = (Y[^^2 K n( 0 ~ 0: ’)fo( x lb-i)T+i\ x lb-i)T)) 

i=1 j=1 

Then the approximate algorithm has joint law for b G {2,..., B}, n G {(& — 1)T + 1,..., bT} 

N n N 

nw + i(^-i )r+ i) n nw.^(Ci,«.)(««;) 

i=1 fe=(b-l)T+2i=l 

where 1 bT is the empirical measure of the particles at time k — 1. An estimate of the 
form (6) can be used to estimate the targets. Note that the cost of the algorithm is not 
0(N 2 ) as one does not need to evaluate r)( fc _ 1 ) T+1 (d(b_i)T+i), even in the PMMH steps. 

2.5 Related Simulation Methods and Alternatives 

Similar, but different, ideas have appeared in several articles including [2, 6, 11]. These 
ideas are considered in the context of hidden Markov models and partially observed point 


7 



processes respectively. The key differences of our work to [6, 11] ([2] is for maximum like¬ 
lihood estimation (MLE)) are as follows. In the context of [6] we do not use a type of 
‘sequential MCMC’, in that our approach can be used explicitly for online Bayesian param¬ 
eter estimation. The approach of [11] is less general, where the particles are not updated 
with PMCMC, and one has a block length of 1. 

Alternatives to the approach outlined above are; (i) a form of sequential PMCMC in the 
spirit of [6] or (ii) to use an over-lapping or sliding window. For (i), one expects that the cost 
is higher than the above algorithm, and online (filtered) estimates are not available. For (ii) 
the cost may be higher, but, in simulation studies, we did not find any obvious improvement 
in practice. We also remark that our approximate algorithm uses kernel density estimation, 
but, in principle, any approximation scheme could be used; this is demonstrated in Section 
4 where the method in [8] is used in place of a kernel density estimate. In practice, the 
algorithm is implemented with dynamic resampling according to the effective sample size. 

3 Theoretical Results 

Throughout, it is supposed that for any n > 1, sup e x ge{y n \x) < +oo. 

3.1 Consistency 

We will now show that, under a specific choice of the bandwidth h and under some mathe¬ 
matical assumptions, the algorithm just presented is consistent. We set K b {0) = h~ d K(h~ 1 9); 
it is supposed that for any fixed 0, lim/^o Kh(9) = 0. For simplicity we denote rj n (x) 
(resp. E„), n £ {0,... ,T} as f) n>T {x) (resp. E n ). 

(Al) We have 

- f e K{9)d0 = 1, K(0) > 0 VO £ 0 

- / \\9\\ 2 K(0)d0 < +oo 

- KeC b {Q). 

(A2) For each n, b, b £ {1,..., B}, n £ {(6 — 1)T + 1,..., bT}, there exists a L n ^ > 0 such 
that for every 9, O' £ 0 

\fln,bT(0) — fln,bT{0')\ < L n ,fc||0 — 0'\\. 

(A3) We have h = N~ =(<*+p, in addition there exist a C > 0 such that for every N 
sup eee \K n {9)\ < C. 



(A4) For any b G {2,..., B}, n G {(b — 1 )T + 1,..., bT}, ip G B(E n ) the function r/ i-A 

M^ bT (y>)(x) is continuous at rj = n n _i^T uniformly in x G E n _i; also sup, ? x Mf) bTri (ip)(x) < 
+ 00 . 

The first three assumptions are from [10] to allow convergence of the SMC estimate of 
the kernel density and the final assumption is from [3]. We set 

1 N 

f l(b-l)T.(b-l)T ( y K N{9)fg{x (b -l)T+l\-)) = JfYl Kn ^ ~ )/»(Z(b-1)T+1 |^ b _ 1)T ) • 

i=i 

Let —denote convergence in probability as N —> oo. We now have the following result 
whose proof is in Appendix A. 

Theorem 3.1. Assume (Al-f). Then for each n,b, b G {1,...,!?}, n G {(b — 1 )T + 

1,..., bT} and ip G B b {E n ) we have 

VntTiv) “bp fin tbT (<p) 

and in particular for every 6 G 0, xph-i^T+i G X 

V(b-l)T,(b-l)T(^N{9)fe(X(b-l) T +l\')) -tP C(^£(b-l)T+l)- 

Remark 3.1. The result here is essentially qualitative. In order to obtain consistency, one 
must adopt an exponential effort in d and so one would only run (with the choice of h as in 
(A3)) over the exact algorithm if the time horizon is long and d is small. In practice one 
will decouple h and N, leading to estimates which are biased, even if N —► oo. We will now 
study the bias. 

3.2 Bias 

We now investigate the asymptotic (in N ) bias of the approach. We make the following 
hypothesis: 

(A5) There exists a 6 G [1, oo) such that for any n > T + 1: 

Gn(x) . , 
sup , < d. 

(a^y+En G n (y) 

There exists an e G (0,1) and for any b G {2,..., B}, n G {(&— l)T+2,..., bT— 1} there 
exists a v G 3?(E. n ) such that for any probability density if on E n -\ (x, y) G Ef l _ 1 

AIn^bT,tp(x , *) + e^I\/I n b x,^p(x, *) A Z/(')^ . 
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We write the TV—limiting version of (7) as TT n ^T,oo (which one can easily prove exists, for 
any b G {2,,B}). For a bounded <p : E bT —> K. we define the bias as 

B {bT, ip) = - E* niW , i00 1 [ip(Xl ^ x , a 1 ^, 9)} . 

We have the following result whose proof is in Appendix B. 

Theorem 3.2. Assume (A5). Then there exist aC < +oo, pi, P 2 € (0,1) such that for any 
b e {2,..., B}: 

BibT^KCMUl-p^+p^). 

Remark 3.2. The result indicates that some of the bias can fall at a geometric rate as T 
grows. Note, that the bias cannot disappear once, one block is wrong (which is by assumption 
in the statement of the theorem). The result also provides the reassuring point that bias’ do 
not accumulate as the number of blocks grow, albeit under strong assumptions. 

4 Simulations 

4.1 Gaussian linear model 

4.1.1 Model and algorithm setting 

We consider the following Gaussian model, 

vo(xo) = </>(a;o;0 ,t 0 -1 ) 

fg{xk\x k - i ) = 4>{x k -,x k - i , t -1 ) 
ge(,Vk\xk) = 4>{Uk\x k , A” 1 ) 

where 4>{x\ p, a 2 ) is the density function of a Normal distribution with mean p and variance 
a 2 . 

The data is simulated from to = r = A = 1, with 10,000 time steps. The algorithms are 
implemented with To fixed to 1 and the other two being estimated. That is, the parameter 
is 9 = ( t , A). 

We consider three algorithms. First, due to the simple structure of the model, we can 
obtain exact evaluation of p(y k \yi-.k-\) through a Kalman filter. We will replace the particle 
filter in the SMC 2 algorithm with the Kalman filter; this results in a regular SMC algorithm. 
This is used to provide accurate unbiased estimators when comparing algorithms. The 
second is the SMC 2 algorithm and the third is the proposed new algorithm, which is termed 
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SMC 2 FW. The PMCMC step in each algorithm is constructed as a two-block Metropolis 
random walk, with a Normal kernel on the logarithm scale. 

With the Kalman filter, the computational cost of p(yk\yi:k-i) is negligible and thus we 
use a large number of 0-particles, N = 10000. This allows us to obtain unbiased posterior 
estimators with very small variance. As a result, this provides a good baseline for the 
comparison of the other two algorithms. 

For the SMC 2 algorithm, we use N = 1000 particles for the SMC algorithm and N x = 
1000 particles. For simplicity, the value of N x is fixed through the time line. 

For the SMC 2 FW algorithm, we set N = 10, 000 and N x = 1000. As we will see later, 
despite the increased number of the 0-particles, the computational cost is still significantly 
lower than SMC 2 , in addition to be upper bounded per time step. The kernel density 
estimate (KDE) is bivariate Normal on the logarithm scale, with covariance matrix taken a 
diagonal form, E = h /2 where I 2 is the identity matrix of rank two. We consider h = 0.01, 
0.05 and 0.25. Three widths of each fixed window T = 125, 500 and 1000 are also considered. 

4.1.2 Results 

In Figure 1 to 3 we show the average of estimates over 30 simulations, given different 
bandwidth h and window width T. The first 125 time steps are cutoff from the graphs, 
during which time the SMC 2 FW algorithm is exactly the same as the SMC 2 algorithm. 

Algorithms labeled with “Kalman” means a Kalman filter is used in place of a particle 
filter and thus exact values of p(yk\yi:k-i) are calculated instead of approximations. In this 
case, the results show the behavior of the SMC 2 FW algorithm when N x —> 00 . However, 
in this particular example, errors introduced by the particle filter approximation is minimal 
as shown in the graphs. 

It is as expected that the longer the window width, the better the SMC 2 FW algorithms 
perform. The choice of the bandwidth h has a more dramatic effect on the performance. 
With h = 0.01 the algorithms give almost exact results even for T = 125. On the other 
hand, with h = 0.25, the errors are numerous. For the state X^. neither the bandwidth nor 
the window width affect the results in any observable way. For the parameters, h = 0.01 
and T = 500 provides the best results. 

In Figure 4 the MSE of the two algorithms are shown, using the results obtained with a 
SMC algorithm using Kalman filters (SMC 2 (Kalman) in previous figures) as an unbiased, 
accurate estimate of the true posterior means. It can be seen that for the state X/ c , the 
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Algorithm - SMC2 - SMC2 (Kalman) - SMC2FW SMC2FW (Kalman) 



Figure 1: Average of estimates of state Xk for the Gaussian linear model. 

results are mixed. For the r parameter the SMC 2 algorithm has a smaller MSE while the 
opposite is true for the A parameter. It shall be noted that, the SMC 2 FW algorithm use 
N = 10, 000 while the SMC 2 algorithm only use N = 1000. However, despite the large 
difference of the numbers of 0-particles, the SMC 2 FW algorithm is still significantly more 
computationally cost efficient. It took about thirty minutes for a single run of the SMC 2 FW 
algorithm under this setting while it took about five hours for the SMC 2 algorithm. More 
importantly, the cost of SMC 2 FW is bounded per time step, and thus it is possible to obtain 
better results than SMC 2 for all parameters while having a bounded, smaller cost in the 
long run. 
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0 2500 5000 7500 1 00000 2500 5000 7500 1 00000 2500 5000 7500 1 0000 
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Figure 2: Average of estimates of parameter r (transition precision) for the Gaussian linear 
model. 


4.2 Levy-driven stochastic model 

4.2.1 Model and algorithm setting 

We consider the Levy-driven stochastic volatility model, applied to 1,000 recent S&P 500 
data from September 17, 2010 to September 8 , 2014. The data are obtained as logarithm 
return of the daily adjusted close price, and then normalized to unity variance. The data is 
plotted in Figure 5. There are considerably much larger volatility at the beginning of the 
series. It becomes much more stable in the middle and slightly larger at the end. 

The model we employed to analyze this data is the same as in [1] and we use the same 
notations and formulation as in that paper. The model has four parameters 9 = ( k , <5, 7 , A) 
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Figure 3: Average of estimates of parameter A (observation precision) for the Gaussian 
linear model. 

and a two-dimensional state X n = (cr 2 (roA), z(nA)) where A is the time interval and in this 
example it set to constant 1. 

We consider three algorithms. First, the SMC 2 algorithm with N = 1000 and N x = 1000. 
Second, the SMC 2 FW algorithm with the same number of particles. The same Normal 
KDE approximation is used as in the last example. Various bandwidths of the kernel were 
considered, and h = 0.01 is chosen. The last, the idea coupled with the parallel particle 
filter algorithm (PPF) [8], instead of the KDE approximation, is considered. The latter 
two algorithms use a window width T = 200. We found that further increasing the window 
width does not improve results in a significant manner. 
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Figure 4: MSE of estimates of parameters and states for the Gaussian linear model. The 
SMC 2 FW algorithm uses h = 0.01 and l = 500. 

The algorithms again use Normal kernels on logarithm scales for the PMMH proposals. 
Using results from the particle filter paper, we believe that n and S are strongly correlated 
and they are updated in one block of PMCMC move while 7 and A are updated individually 
in their own block. The proposal scales are calibrated on-line using the moments estimates 
from the sampler. 

4.2.2 Results 

In Figure 6 we show the average of estimates over 20 simulations. All three algorithms give 
similar results. Compared to results from the SMC 2 algorithm, the SMC 2 FW algorithm 
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Figure 5: S&P 500 daily log-return data. 

give slightly better result than that of the SMC 2 FW-PPF one. 

One feature that was mentioned in Section 3, but not shown clearly in the early simple 
Gaussian linear example, is that the bias of the online algorithms does not accumulate over 
time. For instance, consider the a 2 state, though non-trivial errors can be observed for both 
on-line algorithms in the early time steps, they were not carried on to later times. 

In addition to the estimates, prediction of the variance (square of volatility), cr 2 (nA) 
is also calculated as E[X„|yi : „_i] = E[E[X n |X„_i]|i/i :n _i]. Since the transition density is 
not of a closed form, its expectation is estimated with 100 samples of X n generated for 
each value of X n _i in the particle system. The outer expectation is approximated with the 
particle system at time n — 1. The results for the three algorithms are plotted in Figure 7 
against the squared log-returns. 

4.2.3 Summary 

Through two examples, it is shown that with some careful choice of the KDE bandwidth, or 
using the parallel particle filter algorithm, which does not require this layer of tuning and an 
appropriate window width, it is possible to obtain results competitive to the SMC 2 algorithm 
with only a fraction of computational cost. In the Levy-driven stochastic model example, 
the theoretical feature that the bias does not accumulate over time is demonstrated. 
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Figure 6: Averages of estimates for the Levy-driven stochastic model using S&P 500 data. 

5 Summary 

In this article we have presented a method for Bayesian online static parameter estimation 
for state-space models. Our method is such that the computational cost does not grow with 
the time parameter and moreover, the algorithm can be shown to be consistent, although, 
the cost is then exponential in the dimension of the static parameter. We have additionally 
shown that the asymptotic bias, under strong assumptions, does not grow with time. 

There are several avenues for future work. First is the use of alternative approximation 
schemes in our algorithm; we have relied on kernel density estimation, but there are other 
schemes which could be used. Second, in our work, we have investigated the asymptotic 
bias. However, as is clear in the proofs, then the consecutive blocks are independent and 
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Figure 7: Prediction of variance for the Levy-driven stochastic model using S&P 500 data 

one thus expects that the study of the finite sample bias is significantly more challenging; 
an investigation of this is warranted. 
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A Technical Results: Consistency 

Proof of Theorem 3.1. For n e {0,..., T} the result follows by standard theory; see [10, 12, 

13]. We first consider ij^ T (K N (d)f s (xT+i\-))- Denote by & t the cr—algebra generated by 
the particle system up-to time T and expectations w.r.t. the law of the simulated algorithm 
as E. Then we have 

VT,Ti K N(0)fe(x T+ 1|-)) -<(0,xt+i) = 

^ T (A^(0)/ e (^ T+1 |-))-E[^ T (AW(0)/ e (a ; T + i|-))I^T]+E[^ T (AW(0)/ e (xT+i|-))I^T]-C(0^T+i). 

For the first term on the R.H.S. one has by the (conditional) Marcinkiewicz-Zygmund in¬ 
equality (conditional on the samples are generated independently): 

E[{fj^ T (K N (e)fe(xT + i\-)) -E[^ T (K N (d)f g (x T+1 \-))\^ T }) 2 } < 
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where C' depends on C in (A3); thus T {K^{9) fe{xT+i\-))—K[q^ T (KN(0)fe(xT+i\-))\^T] 

converges to zero in probability. 

Now, for 

E[fi^ T (K N (9)f e (x T+ i\-)WT}-aO,x T+ i) 

E[fj^ T (K N (9)fe(xT+i\-))\'&T] = ^T(VT-i)( K N(d)fg(x T +i\-)), and the denominator con¬ 
verges in probability and by the arguments in [10] [Theorem 4.1] the numerator will con¬ 
verge in probability to the appropriate quantity; that is $t{j1t-i)(Kn{ 9) fe{xT- t-i|’)) ~ 
C(^) a; T+l)IP > —> 0. 

For f/r+i 2 t(^)) this converges in probability to fjr+i, 2 T{<p) by centering by the condi¬ 
tional expectation (given <&r+ 1 ) and applying the (conditional) Marcinkiewicz-Zygmund in¬ 
equality for f)T + i 2 t^P) ~ ^[Vt+i 2 t( ( / 3 )I^t+i]- The term E[r)|f +1 2T ( < p)\^t+i] will converge 
to fjr+i,2T (^) using the conditional i.i.d. property and the fact that fj^ t (Kn{ 0)fg(xT+\\-)) 
converges in probability to ((9, Xt+\)- 
For f]j, +2 2 t(v 3 )) we consider: 

9t+2,2t{v) - ®T+2,2T,-kt+i,2t(Vt+1,2t)( 1 P) + ®T+2,2T,»t+i,it (Vt+1,2t) (v) “ ’?T+2,2t(^)- 

The first term is dealt with via the (conditional) Marcinkiewicz-Zygmund inequality as 
above. For < Ft+2,2T,* t+ i, 2T (^t+i,2t)(¥ 3 ) ~ Vt+2,2t(<P ) as Vt+i,2t(Gt+i) converges in prob¬ 
ability to 57t+ 1 ,2t(G ! t+i) ) we need only consider 

VT+l,2T(GT+lMT+2,2T,i T +i,2T( t P)) ~ VT+1,2 t(Gt+iMt+2,2T,-k t +i,2t (V 3 )) = 

9t+1,2t{Gt+iMt+2 1 2T,tvt+i,2T &)) ~ Vt+1,2t(Gt+iMt+2,2T,tt t +i,2T 

Vt+1,2t{Gt+iMt+2,2T,-kt+i,2T ( V 3 )) _ 9T+1,2t{Gt+iMt+2,2T^t+i,2T ( V 3 )) 

The last term on the R.H.S. converges to zero by the above calculations. So we focus on 
the first term on the R.H.S. we have 

n\vT + i M) - Vt+ 1,2t(,^T-\-iMt-\-2,2T,7Tt+i,2T {v>M < 

CE[\[M T + 2 , 2 T,n T+ 1 : 2 T {^)(o.^ +1 ) _ Mt+ 2 , 2 T, 7 v t+ 1 , 2 t ( ( P){ 0 !t+i)}W 

Let e > 0 be given. By (A4) there exists S > 0 independent of o^+i suc h that for any 
probability density rj with |rj — ttt+i, 2 t\ < S we have that \Mt+ 2 , 2 T,k t+1 2 t (^)(c*t+i) ~ 
TfT+2,2T,7r T+1 ,2T(^)( Q T+i)l < e /2- Consider the event: 

A(N, S) = { \i^T+l,2T — 7TT+1,2t| < <5} • 
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Then 


E[|[M T+ 2,2T,* t+ 1,2 T 0)( a T+ 1) Mt+2,2T,tt t + 1,2 T a T+l)]l] — 

E[\[M T +2,2T,-KT + i,2l-( ( P)( a T+l) ~ M T +2,2T,KT+i,2T( ( P)( a T+l)}\^A(N,6)] + 

E[|[Af T+ 2 , 2 T,* T+ i, 2 T (^)(aT+l) ~ M T +2,2T,ir T+lt 2T( l P)( a T+l)}\^A{N,6y} < 

e/2 + C¥(A(N, S) c ). 

By the convergence in probability of T (KN(d)fg(xT+i\-)) here is an N 0 > 1 such that for 
each N > N 0 we have 2 C ■ P[H(.ZV, (5)°] < |. Hence, for any N > N 0 : 

E [|G'T+l(aT+l)[-^T+2,2T,7r T+1| 2T(¥ , )( Q; T+l) “ Mt+2,2T,v t +i,2t (V 3 ) ( a T+l)] I] < 

and as e > 0 was arbitrary, the term of interest goes to zero in Li; this completes the proof. 
The proof can also be repeated if one considers a Kn(0 — O') as part of the function (the 
argument is almost the same). The proofs at subsequent times follow the above arguments 
and are omitted for brevity. □ 


B Proofs for Bias 


In the context of the proof for the bias, we need only consider one block (as will become 
apparent in the proof), as blocks are independent in the asymptotic bias. In addition, one 
significantly simplify the notations by simply considering two Feynman-Kac formula of T 
steps, with different initial distributions, the same potentials and different Markov kernels 
on measurable spaces (E 0 , £ 0 ), ..., (Et, St)- Thus, for k € {1,2} the two Feynman-Kac 
n— time marginals: 


Vn{dx„) 


InidXn) 

7n(!) 


with 



This corresponds to our case, as the potentials are the same, with the Markov kernels and 
initial distributions different. Our proofs will depend a lot on the Bayes rule, which we now 
recall, for fi G Z?{E p _ i) 


/.t(Gp-i) 

We use the notation := • o i> p+i (ji). q> p> 0 (with the convention when 

p = < 7 , one returns fi). Our assumption (A5) under the modified notation is 
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There exist a S £ [1, oo) such that for every p > 0 

G p (x) 


sup 


< <5. 


x,yeE p G p (y) 

• There exist a e € (0,1) and for T — 1 > p > 1, v £ 3A(E p ) such that for each 
k € {1, 2}, T — 1 > p > 1 every x, y € E p ^i 




Recall that the final Markov kernel is a Dirac measure. 


Proof of Theorem 3.2. We have that, under our modified notations 

B(l» = - ^( ??0 2 )(^)| < |- *k»7o)(¥>)l + I- K(V 2 0 )(V)\ 


By Lemma B.2 


\®t{Vo)(<p) - $t(?7o)MI < 




2\T—1 


Then 


\*t(v 2 o)(v) - = \®t{$t-i{Vo))(v) - * 2 T (*T-Mm<p)\- 

as = d>y(/z)(</?) for any p € &(Et-i ). Now 


$t(^T-i(.Vo))(t) ~ < ^T’( < f ) T-l( ? 7o))( < / 3 ) — 


&T-M S)(Gt-i) 

^T-I^oK^T-IV 1 ) 


[*t-M]){G T - r)-^ T ^l){G T -x)}. 


Then, by application of Proposition B.l along with (A5) it follows that 

UPMoo (5) (l - (1 - Sf- 1 ) + 4<5 4 |M|oo (5) (!-(!- e 2 ) T_1 ) 

which allows one to conclude the proof. □ 

Remark B.l. As one can see from inspection of the proof, the difference in initial distri¬ 
bution does not impact the bound. Moreover, the difference in Markov kernels is controlled, 
leading to a control of the bias; such a latter property is not obvious a priori and needs to 
be proved. As is evident from the proof, it does not matter which block one considers, under 
our assumptions. 
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Proposition B.l. Assume (A5). Then for any y € h?(E 0 ) 

II^t-i(aO — c E > T-i(/ x )lltv < 2(-j) ^1 — (1 — e 2 ) T ^ 

where e,6 are as in (A5). 

Proof. We have the standard telescoping sum, for tp : Et -i -> [0,1] 

~ [^fc,T-i(^fc(M)) _ ^fe+i,T-i (^fe+i(^)) ( t )- 

fc =0 

Now, by Lemma B.2 

- ^l+i.T-i (^fc+i(tO) tv — 

2 (;) 2(1_e2)T ~ < ’~1®* + i(®h(<))-®Li(»iW)|| w . 

By Lemma B.l 

\$l+i(®l(T)) -*k+i(*fc(M))| tv < (1 — e) 

so we have proved that 

T— 1 

ll*k(P+i)iM - Kl,(P+ i)j(^lltv < 2(;) 2 (1 - e) £(1 - e 2 ) T ~ k ~ 2 , 

k=0 

from which one can easily conclude. □ 


Lemma B.l. Assume (A5). Then for any T— 1 > p > 1, fi G Z?(E p _i) and <p : R d —> [0,1] 
we have 

where e is as in (A5). 

Proof. We have 

[$» - *»](<?) = — M 2 \(tp)). 

Now, by (A5), for any x, k one can write 

M k (x, dy) = (1 - e)Rp(x, dy ) + en(dy) 
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The following Lemma just collects some results of [12] into a convenient form for use in 
the above proofs. We make the following defintions for 0 < s < t, k G {1,2}: Q^(x,dy) = 
G s -i(x)M*(x,dy), 


Q s ,t( x P' dxn) 


t 

| | Q q{.X q—l ■> d,X q) 
q=s+l 


and P* t (x,dy) = t (x,dy)/Qg t (l)(x). For a bounded and measurable real-valued function 
ip we denote 

IMIosc = sup |v?(ar)| + sup \ip(x) - <p{y )|. 

x x,y 

Lemma B.2. Assume (A5). Then for any 0<s<t<T— 1 and any p, p G 3P{E S _ i), 
k G {1,2} 

ll*J» - *J»lltv < 2(^) 2 (1 - e 2 ) t ~ s \\p - p|| tv 
where e, 6 are as in (A5). 


Proof. By [12, Theorem 4.3.1] 




IIQL(i)llc 


hiQiti 1 )) A p(Qs,t( i)) 


\\p-p\\u 


where /3(P s fc 4 ) is the Dobrushin coefficient of P s fc t . Then by [12, pp. 142] (fourth displayed 
equation) and (A5) 

/3(P s fc t ) < (1 - e 2 y~ s . 


Note that via (A5) and [7, Lemma 4.1] one has 

QS, t (l)(aQ 6 

xT Qs, t (i)(y) “ e 

and as a result via [12, pp. 138], for any pair of probabilities p, p £ @ > {E S „ 1 ) 

\\Qs,t (l)||osc < 2 f ^^ 2 

P-iQsA 1 )) ^ PiQsA 1 )) ~ 

The result thus follows. □ 
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